OOCOflllS BBSOIB 



BO 094 00^ 



95 



TH 003 864 



AOIHOR 
XITIB 

SPOHS A6EMCT 

FOB DATE 
MOTE 



BDfiS PBICB 
DESCBIPTOBS 



0* Connor, Patricia; And Others 

laproving Beliabllity in Assessment of Technic 

Products. 

National Institates of Health (DHEi), Betbesda, ad. 
Bureau of Health Hanpover Education* 
£Apr 74] 

9p.: Paper presented at the Annual fleeting of the 
Asericaii Educational Besearch Association (59tfaf 
Chicago, Illinois, April 1974) 

HF*S0.75 HC*|1.50 PLOS POSTAGE 

^Dental Schools; ^Evaluation Criteria; ^Perforaance 
Tests; Bating Scales; ^Student Evaluation; ^Test 
Beliability 



ABSTBACT 

For instructional Materials to be certified as 
**ef fective,^ students aust seet instructional objectives 
operationalized by criterion tests. By isplication, evaluators aust 
agree vhen criteria are or are not net. Fourteen instructors 
evaluated 10 posterior bridges. Inter judge agreements fcs total 
bridges- and individual attributes vere lov, as they tend to be 
whenever dental technic products are evaluated. A Mthod for 
developing sore reliable rating foras is described. It consists of: 
(1) lisiting discrisinations to the dichot^oaous decision 
^acceptable**, ^unacceptable^; (2) initially resolving differences 
aaong faculty; and <3) defining characteristics of acceptability on 
observable tera^, and providing a photographic exa^aple of a aipiaally 
acceptable product for each attribute. (Author) 



ERLC 



Improving Reliability :n Ag i^cssnicriL of Technic Products 

Patricia O'Connor, Pobert E. Lorey and Arun Oarg "'ff!«^^^^ 

^ ^ NATIONAL tNSTlTUTC OF 

The University of Michigan School of Dentistry ?Hel?p\ot'ni« ^^^'^^"^^^ 

^ ^ ^tRSON OR ORGANi;aT(ON ORIGIN 

QSTc^cnV^^J,^? °' ^'^^ OP.NfONS^ 

ft A I . OJ^JC^AL NATIONAL INSTITUTE or 

UJ CDuCATrON POSITION POLICY ' 

Studies indicate that agreement among teachers in assessing 
the quality of student technic products is low. Furthermore, the 
evidence for improving consistency among raters through training 
prograns has been discouraging. Low agreement among faculty raters 
is^ found not only in assessments of inter judge reliability, but 
instructors also differ widely in severity of grading. Lack of 
agreement creates serious problems for students. Equity is compro- 
mised since students' grades are influenced by who evaluates their 
technic products. Inconsistent standards reduce the value of diag- 
nostic feedback. Questionnaire data indicate that students frequent- 
ly identify the low level of faculty agreement as an instructor 
behavior that interferes most with their learning. For example, one 
instructor may describe a cavity preparation as "too deep," a second 
as "too shallow," and a third as "acceptable." Stories which may 
or may not be apocryphal report students, whose work is criticized 
at a chccl:point taking the unaltered product to another instructor, 
or even to the same instructor, to be told their work is acceptable 
QQ or even excellent. One investigator has drawn a parallel between 
evaluations in technic courses and studies in which "experimental 

CO 

— neurosis" is produced in laboratory animals by presenting probleir^s 

Q 

jpii^ in which the correct response is deliberately varied over trials. 



©""he work rpport»?d here was supported in part by Grant / ^^D0(?- 12H0-01-02 
fFRJC f the Department of Health Education and Welfares, Bureau of Health 
' — ^Janpower Education. JLi O<^. 



The problcj;) becomes critical when one alLcnpts to develop and 
evaluate instructional niatcrials designed to teach students to niake 
products which ip.ee t pcrl'onnance criteria. In the absence of clear 
performance standards, the certification of materials as "effective" 
is not possible. Our work indicates that one requisite to produc- 
ing acceptable prodncls is learning to discriminate between those 
that meet and fail tc meet specific criteria. Observations in the 
laboratories have shown that in a substantial proportion of student- 
faculty interactions students ask facu 1 ty whether or not a particular 
step in creating a product is satisfactory. Moreover, when students 
were asked in a questionnaire if presenting slides showing ideal and 
flawed preparations before they were required to raake these prepa- 
rations themselves would contribute to their learning, over eighty 
percent responded affirmatively. Our initial attempts to provide 
this instruction have received a positive student response. In- 
stru^:tional units to teach the attributes of a good product are 
required. To teach students the standards for which they should 
*strive and by which they may evaluate their worl^ necessitates that 
faculty first agree upon standards of evaluation. 

This paper reports one study in a continuing effort to develop 
methods which will result in consistent evaluation among faculty 
judging technic products. The_product to be judged v/as the prepa- 
ration of a tooth for a posterior bridge. The rating form normally 
used in evaluating the preparation is comprised of six criteria each 
of which is to be applied on a five point scale; clinically unac- 
ceptable, minimally acceptable, adequate, good, and excellent. A 
rating of "adequate" iiriplies that the student has met course objectives 
for the criteria applied. On the rating scale only the attributes 



thcmsclvej arul ratiruj r>calc cu c listed. Tlio purpoue Lhis 

study was to icjl whclhor providing for reference six color,' slides 
each of which illustrated "adequate" meeting of one criterion would 
result in improved faculty agreement* Agreement as used here refers 
not only to inicrjudge reliability but also to reduced variability • 
among instructors in stringency or leniency in rating. 



Method 

Ten faculty members who instruct students in the sophomore 
course in preclinical crowi^ and bridge were subjects in the study. 
The names of I4.O students were selected at random from ihe class list 
of lUij.. Twenty were assigned at randojn for evaluation in the first 
rating session and the remainder for evaluation ii^ the second fating 
session. Preparations were assigned numbers to disguise the identity 
of students. Ratings were made individually by instructors under 
normal illumination with added light from a tensor lamp. Instructions 
/to faculty for the first rating session were as follows: 



The following items are criteria for evaluating the 
quality of the molar preparation for sophomore students' 
bridges. Please evaluate each attribute in terms of the 
five point scale provided. In making your judgments, please 
take the orientation that ratings of 3> il-f and 5 indicate 
that on the attribute rated the student has met the course 
objective. Ratings of 1 and 2 indicate that the student has 
fallen far short or somewhat short of meeting the course 
objective. 

Pleasfj use all parts of the scale and r emcitibcr t ha t 
a rating of 3 c>r better implies that for the attribute judged 
tlie stodcnL lias met the course objective. A rating of 2 or 1 
implies the student has failed to meet a course obj/aci-ive. 



Ratings v/cre made and recorded. From t!u?i>e ratings v/e selected 
six preparations each of which would be used to illustrate "adequate" 




ing of one of the six criteria. Baculty consensus was the basis 



of select iou, I Jrally, eacli selorLcd pt eparaLion would have been 
rated "3" or ''adequate'* by all raters. Since coinpletc accord was 
achieved in no inslancr, wc selected Tor each criterion that prepa- 
ration I'or which the mean rating was closest to "3" and for which 
variability was minimal. One of the authors (REL) who is a senior 
member of the Crcv;n and Bridge Departrncril photographed each of the 
s<>lectcd preparations from tlie vantage point at which the particular 
characteristic could be best viewed. From the resulting color slides 
in which preparations were equal in size to the preparations them- 
selves, he and a second member of the drown and Bridge Department 
selected the one which most clearly illustrated adequate attain- 
ment for each of tlic six criteria. 

In the second evaluation session, the six slides were presented 
simultaneously in a lighted view box and were labelled with the 
^attribute illustrated. Instructions to faculty v/ere identical to 
those in the initial rating session with the following addition: 

The illustrations provided represent faculty consensus 
of a rating of "3"? i.e. "adequate" at ta inment for sophomore 
performance objectives for the attributes illustrated. In 
maUing your ratings, please assume that the example provided 
warrants a rating of 3. Use the illustrations for reference 
in making ratings of all preparations. 

PLEASE C0r:3IDER OIILY THE INDICATED ATTRIBUTE FOR A GIVEN 
PREr/Uw\T lOrr. The process to foUov/ in each case is: 

Assuming that the phot ograpl^ i I lustra t es "adequate" 
attainment for the attribute in question, what rating should 
be assigned to the preparation rated? 

- If it is ^lubs tant ia My inferior to the preparation 
shown for that attribute, assign a "1". 

- If it is sut'ievhaL infericr to t.he preparation shown 
for tliat attribute, assign a "2". 

- If it is cquaT in quality to the preparatioiA shown 
for that attribute, assign a "3"* 

- If it is somcvjhat 'superior to Vh6 prr^paratioi^ shown 
for that attribute, assign a . 

- If it 1 subiitantially superior in quality to the 

O preparation siiown for that attribute, assign a "5". 

ERIC 



Results 

In the results, faculty agreement in the initial session and 
the session with slides are compared, first with regard to inter- 
judge reliability and second, with regard to stringency of s-tiandards. 

To assess inter judge reliability for the preparations as a 
whole, ratings for each instructor v/ere sunmed over all attributes 
for each preparation. The possible range of scores for a given 
preparation is from six to thirty. Product moment correlations 
between each judge's scores for the twenty preparations and the 
combined scores for the remaining nine judge^B were computed for the 
first and second rating sessions. The mean r for the initial rating 
session was .70 and for the second .83, showing a mean increase of 
.13. Of the ten instructors, r was higher in the second session 
for eight, the same for one and lower for one. This result was 
significant at the ,01 level using the Wilcoxon matched-pairs signed 
ranks test. 

Since we were interested not only in reliability of assessment 
as a whole but also in reliability for each of the six criteria, 
we examined faculty agreement for each^of the attributes separately, 
ror this purpose we employed the coefficient k developed by Jacob 
Cohen. Ratings were collapsed to three categories; unacceptable 
(ratings of 1 ar\d 2), acceptable (rating of 3) j and superior 

V ■ 

(ratings of arftJ f>) . Fo^r each pair of raters, scores were cast 
in three by three tables. Agreements consist of those instances 
in which both raters have assigned the same ratings for the twenty 
preparations Judged on a given criterijon. The coefficient k is 
simply the proportion of agrecjncnts that occur after chance agree- 




't is removed from consideration. In the analysis k's were 



computed lor ail pairs '>r raters lu the initial session and in 
the si ide-pr^.'^svuit rating sessions* For each pair of raters tho 
direction of dilfcrcnces In k between the tv/o rating sessions was 
recorded. Higher pot'itive k's in the second session would Indicate 
higher roliaMlily on inc'ividual attributes. Results using the sigh 
test indicat^'d that for three attributes interjudge rel iabU i ty was 
higher in the second rating session; p values were < .02, < .002 
and < .001 for these attributes. For the rcnaining three attributes 
no improvement was shown, nor did trends e^^en approach statistical 
significance. The Uiarkedly :r<proved reliability for three variables 
and its absence on the remaining three is difficult to attribute 
to non-systematic factors. Our post hoc explanations bear on the 
inadequacies of the slides for the particular attributes shown. 
A discussion of our hunches on this point would however require 
describing more than you would care to know about molar chamfer prepa- 
ra t 1 ons • 

We wished also l.o investigate whether, independent of changes 
in interjudge reliability, providing slides to define "adequacy" on 
each attribute reduced the differences in stringency or leniency in 
grading among instructors. We selected a n^ethod of analysis for 
this que3t?on ba:^ed upon some reasoning I would like to describe 
briefly. If raters differ consistently among thei'iselves in stringen- 
cy of grading it follows that on each of the twenty preparations., 
the magn* tilde of the t^ummc.d nuiMcrical score v/ill sys t en)a t i ca 1 ly 
vary with the instrnctor. In the most extreme case the sairje instructor 
would cons i r, tf^n t ly as^.ign the highest score , another consistently 
assign the second highest score, another tho third highest, s^core 
and so forth. In this extreme case, rho's or ranl< order correlations 
ERJC j ted betv/ccn scores or\ any pair of preparations would result in 



rho's of 1, To llic degree that instructors did not vary in the 
stringency oi standards applied, the rho's would approach 0. 

To test the effects of the experimental treatment on instructor 
bias Spearman rho's were computed between all pairs of preparations. 
In this analysis^ each instructor is a subject and his score on each 
preparation, the equivalent of a score on some variable or attribute. 
For exaruple, the scores might be analagous to tv/enty tests in 
arithmetic or spelling. Rho's were computed for the first rating 
session and compared with those computed for the second. The results * 
of the analysis shovyed that the median rho for the first rating 
session v;as .52 and for the second .21, When rho's were broken at 
the combined median for both sessions^ 73% of those for the first 
and 28% of those for the second fell in the "high" group. The chi 
square resulting from this analysis is 77*66 which is significant 
beyond any level recorded in statistical tables. Although the inflated 
N makes chi square analysis inapplicable, the obtained differences 
seem to imply that the stringency differences among instructors have 
been sharply reduced. 

Changes in instructor stringency bias between the first and 
second session may be discussed descriptively. The standard deviation 
for instructor mean ratings in the first session was 2,36 and in the 
second 1.28. Similarly the average deviations v/cre 1.72 and .88 
respectively. The F ratio could not be used to te'jt the difference 
in variability between the two rating sessions^ because of departures 
of scores from normality. In the first session, in fact, the deviations 
of instructors' mean r.cores from the combined mean were biinodally 
distributed. For four instructors the deviation was less than half 




'nt and for two others about a point. However four ir^structors 



« 

I ^ 

« I 

showed cxLrcin*! bias, I.e., Iwo i L ru'j Lors assigned mean scores 
more than two and one half points above the coi-ibined inean and two 
others, nican scores jr.cre than two and one half pointc be low the 
combined ni<^an. 

In the rating session with slides, the mean ratings of three 
of the four deviant instructors were reduced and were within one 
point of the combined mean for that session. The most severe grader^s 
scores were half a point closer to the combined mean, but his iiican 
score^was nevertheless cv^r three points below the cojnbined mean. 

The remaining six raters a group showed no evidence for change. 

i 

In the initial session, the mean deviation score for them was .31 
and in the second For these subjects as a whole, the absence 

of change is not a serious matter of conc^irn since, in the first 
session, they had not showri str ingency bias. The dramatic change 
shov/n in the rho analysis is then attributable to the sharp drop 
in instructor bias for the three of four instructors whose grades 
changed marliedly tov/ard the combined group mean. 

After the second session, In a quesLionna ire item, instructors 
were asked to predict whether their agreement with the combined 
faculty rating would be higher in the first or second session and to 
state the reason for their prediction. Eight of the ten predicted 
irnproven-ent on the second round and two were uncertain. All 
instructors who predicted improvericnt referred to the positive 
contr i bu t i ot\ of comiJion standards for establishing a norm.. | ^^--y- 

Pi scuss i on 

The rc|5ults of the study support the positive contribution of 
^slides i 1 luistra t ing the miniiMum requirement for meeting a criterion 




'"a technic preparation to Instructor agreement; They encourage 



the cont i rnal i un oi' eflorLs Lo develop i ns true I J ona 1 materials that 
will teach iUjd'Mits to learn criteria for evaluation of a product. 
Faculty themselves will of necessity be deeply involved in specifying 
criteria and selecting models and photographs. It is our hope that 
the r)at^;rial5 developed will be used not only for student instruction 
but for faculty instruction as well. If we are successful in develop- 
ing jiialerials for this purpose, we will be able to evaluate the 
effectiveness of instruct ional mater ials designed to teach students 
to produce technic preparations that satisfy standards of excellence. 



I 



ERIC 



